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Abstract 

Although  there  are  many  situations  in  which  a  model  of  application  performance  is  valuable, 
performance  modeling  of  parallel  programs  is  not  commonplace,  largely  because  of  the  difficulty 
of  developing  accurate  models  of  real  applications  executing  on  real  multiprocessors.  This  paper 
describes  a  toolkit  for  performance  tuning  and  prediction  based  on  lost  cycles  analysis.  Lost  cycles 
analysis  decomposes  parallel  overheads  into  meaningful  categories  that  are  amenable  to  modeling, 
and  uses  a  priori  knowledge  of  the  sources  and  characteristics  of  overhead  in  parallel  systems 
to  guide  and  constrain  the  modeling  process.  The  Lost  Cycles  Toolkit  automates  the  process  of 
constructing  a  performance  model  for  a  parallel  application  by  integrating  empirical  model-building 
techniques  from  statistics  with  measurement  and  modeling  techniques  for  parallel  programs.  We 
present  several  examples  to  show  how  the  toolkit  facilitates  the  construction  of  performance  models, 
and  to  illustrate  the  use  of  the  toolkit  in  solving  practical  performance  problems. 
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1  Introduction 


Parallel  programmers  are  faced  with  significant  challenges  in  developing  efficient  programs.  Parallel 
programs  often  beha^^  '  unexpected  ways  due  to  the  complex  relationship  between  the  structure 
of  a  parallel  prograr  chine  on  which  it  is  run,  the  number  of  processors  used,  the  program’s 

input,  and  the  mer  ning  time  of  the  program.  As  a  result,  performance  tuning  of  parallel 

programs  is  an  errr  rime-consuming  process. 


Performance  p 
machines.  The  abii 
consume  valuable  ,r 
mine  the  strengths 


offers  significant  benefits  when  programming  for  efficiency  on  parallel 
diet  performance  helps  avoid  costly  and  error-prone  experiments  that 
Using  a  performance  model  of  an  application,  programmers  can  deter- 
tive  implementations,  and  the  scalability  of  current  implementations. 


Despite  these  benefits,  performance  modeling  of  parallel  programs  is  not  commonplace,  largely 
because  of  the  difficulty  of  developing  accurate  models  of  real  applications  executing  on  real  i;  id- 
tiprocessors.  Since  parallel  overhead  can  arise  from  many  sources,  performance  modeling  rec;  s 
an  analytical  understanding  of  a  wide  range  of  factors  spanning  hardware  and  software.  Proji  i  ’ 
and  accurately  measuring  parallel  overhead  is  painstaking  and  time  consuming.  The  possibility  c.f 
interactions  among  factors  requires  an  understanding  of  how  to  design  experiments  to  isolate  those 
interactions. 


To  address  these  difficulties,  a  variety  of  practical  performance  prediction  techniques  have  been 
proposed.  These  techniques  can  be  characterized  by  their  relative  reliance  on  static  versus  dynamic 
data.  Static  methods  rely  primarily  on  source  code  analysis  to  develop  predictions.  These  methods 
have  the  advantage  of  requiring  minimal  machine  or  application  measurements,  but  may  sacrifice 
accuracy  due  to  the  limitations  of  current  state  of  the  art  in  code  analysis.  As  a  result,  static 
methods  tend  to  be  limited  to  particular  languages,  programming  styles,  or  machine  types.  Within 
these  constraints,  several  static  methods  have  achieved  good  accuracy  (e.g.,  [Clement  and  Quinn, 
1993]). 

Additional  dynamic  information  is  used  in  training  sets  (e.g.,  [Pease  et  al.,  1991])  and  template- 
based  approaches  (e.g.,  [Zimran  et  al.,  1990]).  Training  sets  use  extensive  machine  measurements 
to  characterize  frequently-occurring  instruction  patterns.  The  measurements  provide  accuracy,  but 
the  training  sets  need  to  be  quite  extensive  and  can  be  difficult  to  maintain.  Template-baser’ 
approaches  subsume  most  of  the  static  analysis  task  in  the  selection  of  a  program  template 
complex  performance  models  are  constructed  in  advance  and  implicitly  selected  when  a  temi  ^ 

is  chosen.  The  performance  models  are  cast  in  terms  of  basic  machine  parameters  to  mini  e 

the  need  for  dynamic  measurements.  The  template-based  approach  requires  the  user  to  match  the 
application  to  a  pre-existing  template,  and  so  limits  the  technique’s  applicability  to  specific  classes 
of  applications. 

Scalability  analysis  is  a  completely  different  approach  to  performance  prediction,  where  the  goal 
is  to  characterize  combinations  of  machine  and  application  with  respect  to  their  efficiency  [Grama 
et  al.,  1993].  Scalability  analysis  develops  analytic,  asymptotic  models  of  computation  and  selected 
overhead  categories  as  a  function  of  the  size  of  the  problem  and  the  number  of  processors.  Although 
these  analyses  provide  insight  into  the  inherent  scalability  of  a  particular  application  and  machine 
combination,  they  cannot  be  used  directly  for  performance  prediction  because  of  their  reliance  on 
asymptotic  analysis. 

In  this  paper  we  describe  a  toolkit  for  performance  prediction  that  offers  the  predictive  power  of 
scalability  analysis  and  the  accuracy  of  training  sets  or  templates.  In  contrast  to  training  sets  and 
templates,  our  approach  is  based  mainly  on  dynamic  measurements  of  application  programs,  so  that 
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our  techniques  are  broadly  applicable,  and  machine-  and  language-independent.  Like  scalability 
analysis,  we  divide  parallel  overheads  into  categories  that  are  amenable  to  modeling.  We  call  the 
process  of  building  performance  models  using  these  overhead  categories  lost  cycles  analysis  [Crov- 
ella  and  LeBlanc,  1994].  The  Lost  Cycles  Toolkit,  which  automate  \  process  of  constructing  a 
performance  model  for  a  parallel  application,  uses  a  priori  knowled  uhe  sources  and  characteris¬ 
tics  of  the  overhead  categories  in  parallel  systems  to  guide  and  coi;  V  ie  modeling  process.  The 

toolkit  integrates  empirical  model-building  techniques  from  stati  \  linson  and  Donev,  1992; 
Box  and  Draper,  1987;  Box  et  al.,  1978]  with  measurement  axiu  i  techniques  for  parallel 
programs. 

Our  primary  goal  for  the  toolkit  is  to  offer  sufficient  ease  of  use  geixerality  that  performance 

modeling  can  become  a  standard  tool  for  programmers  during  p-  '  program  development.  In 

this  respect  our  work  differs  from  [Brewer,  1995],  which  also  fits  ncally  measured  data  to 

mathematical  models,  in  that  his  focus  is  on  the  use  of  models  to  seue  cl  from  among  alternative 
library  implementations.  Another  similar  approach  is  used  in  [Toledo,  1995],  but  that  work  is 
restricted  to  SIMD-style  applications.  To  achieve  our  goal,  the  prediction  method  should  require  as 
few  experiments  as  possible,  the  models  produced  by  the  toolkit  must  be  accurate  enough  to  be  used 
in  a  wide  variety  of  scenarios,  and  the  models  must  apply  to  all  performance  factors  simultaneously 
(rather  than  a  single  factor,  such  as  the  number  of  processors). 

The  Lost  Cycles  Toolkit  achieves  these  goals  as  follows.  First,  we  define  a  set  of  overhead 
categories  for  parallel  programs  that  assists  in  analysis,  and  provide  a  tool  (pp)  that  measures 
the  overhead  in  each  category  at  runtime.  We  ensure  efficient  experimentation  by  incorporating 
experimental  design  methodology  into  a  tool  for  automatic  experiment  generation  (expgen).  We  use 
the  performance  measurement  results  obtained  from  the  experiments  to  select  from  standard  models 
for  overhead  categories,  using  a  tool  for  fitting  models  of  overhead  categories  to  experimental  data 
(lea).  To  ensure  accuracy  in  the  resulting  performance  models,  we  submit  poorly-fitting  model 
components  to  the  user  for  iterative  improvement  using  the  tool  modgen.  The  entire  process  is 
coordinated  via  a  graphical  interface  provided  by  the  tool  xmodgen. 

The  remainder  of  the  paper  describes  the  Lost  Cycles  Toolkit  in  detail,  showing  first  how  the 
toolkit  assists  in  constructing  performance  models,  and  then  illustrating  how  those  models  can  be 
used  in  performance  tuning. 

2  Lost  Cycles  Analysis 

Lost  cycles  analysis  [Crovella,  1994;  Crovella  and  LeBlanc,  1994]  is  an  attempt  to  automate  the 
( onstruction  of  performance  models  of  an  application  as  a  function  of  runtime  factors.  The  most 
commonly  investigated  runtime  factors  are  the  number  of  processors  used  (p)  and  the  size  of  the 
input  data  (n).  However,  other  runtime  factors  can  have  a  significant  effect  on  parallel  performance, 
and  lost  cycles  analysis  is  equally  suited  to  investigating  them.  For  example,  the  sparseness  of  an 
input  matrix,  the  proportion  of  bright  pixels  in  an  image,  or  the  connectivity  of  an  input  graph  all 
are  significant  runtime  factors;  we  have  shown  in  previous  work  [Growl  et  al.,  1994]  that  factors 
such  as  these  can  be  quite  important  in  finding  the  proper  implementation  for  a  parallel  program. 

To  model  the  performance  of  a  parallel  program  we  need  to  construct  expressions  for  the  applica¬ 
tion’s  pure  computation  Tc  and  parallel  overhead  To.  Pure  computation  is  the  work  the  application 
must  perform  to  solve  its  problem;  it  corresponds  to  the  cost  of  the  best  serial  implementation  of 
the  application.  Parallel  overhead  consists  of  all  additional  processor-time  spent  by  the  parallel 
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Table  1:  Typical  Functional  Forms  for  Overhead  Categories 


Category 

.Variable 

Typical  Form 

LI 

n 

kin  +  /l2 

LI 

p 

hp^/p  +  k2 

IP 

n 

h 

IP 

p 

kip 

i  SL 

p 

ki  logp  -t-  k2 

!  CL 

n 

kin 

CL 

P 

kip  +  k2 

RC 

P 

max{kip  +  k2,kz) 

implementation.  The  basis  for  constructing  a  performance  model  is  the  formula 

„  ^  T,(r)  +  Ur) 

P 

in  which  T  is  the  running  time  of  the  application  as  a  function  of  the  runtime  factors  f.  If  f  consists 
of  only  one  factor  (such  a  p)  then  T  is  a  univariate  model.  More  commonly,  r  consists  of  p,  n,  and 
perhaps  other  factors;  in  this  case  T  is  a  multivariate  model. 

Values  for  Tc  can  be  obtained  directly  from  uniprocessor  executions,  or  indirectly  from  the  values 
for  To  and  the  running  time  T.  Alternatively,  an  expression  for  Tc  can  be  derived  using  standard 
complexity  analysis  augmented  by  experiments  to  obtain  the  constants.  Finding  a  mathematical 
expression  that  accurately  describes  To  is  more  difficult  however,  because  overheads  can  arise  in 
many  ways  in  a  parallel  system.  Lost  cycles  analysis  attacks  this  difficulty  by  decomposition  of 
parallel  overhead.  Parallel  overhead  is  decomposed  into  a  set  of  categories  that  are  intuitively 
meaningful,  mutually  exclusive,  and  complete.  The  set  of  categories  expresses  overheads  in  common 
units  (ie.,  lost  cycles  or  LC).  Thus,  we  can  directly  compare  overhead  in  one  category  to  overhead 
in  another  category,  since  they  both  correspond  to  time  (ie.,  cycles)  during  which  no  useful  work  is 
done.  We  usually  express  LC  as  its  sum  over  all  processors  in  an  execution;  this  sum  is  equivalent 
to  the  value  To.  Although  overheads  are  not  independent  (so  we  cannot  in  general  vary  one  without 
changing  another)  the  effect  of  changing  LC  in  an  execution  is  the  same  no  matter  what  specif-' 
kind  of  overhead  is  actually  changing.  Thus,  LC  forms  a  single  uniform  metric  for  compai ;  .  ^ 
different  kinds  of  overhead. 

The  decomposition  of  parallel  overhead  into  categories  facilitates  performance  modeling  for  two 
reasons.  First,  the  decomposed  overheads  can  be  modeled  more  easily  than  the  total  overhead. 
Second,  each  category  of  overhead  has  certain  characteristic  behaviors  it  often  follows,  resulting  in 
simple  models  in  many  cases.  Lost  cycles  analysis  exploits  these  two  observations  about  decomposed 
overheads. 

The  ease  of  lost  cycles  analysis  arises  from  the  ability  to  account  for  most  categories  of  parallel 
overheads  using  default  models.  These  default  models  can  be  incorporated  into  a  tool  for  fitting 
models  to  experimental  data.  Examples  of  the  kinds  of  default  models  incorporated  into  our  toolkit 
are  shown  in  Table  1.  The  table  is  representative  and  does  not  include  all  typical  behaviors;  the 
complete  list  is  embodied  in  the  toolkit,  currently  with  at  least  three  typical  behaviors  per  category. 

The  overhead  categories  of  interest  may  vary  among  machines  or  applications.  The  entries 
in  Table  1  focus  on  the  following  categories:  load  imbalance  (LI),  insufficient  parallelism  (IP), 
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synchronization  loss  (SL),  communication  loss  (CL),  and  resource  contention  (RC).  We  also  refer 
to  the  remaining  time  (which  represents  computation)  as  RT,  and  the  total  execution  time  as  TT. 

In  previous  work  [Crovella  and  LeBlanc,  1993]  we  showed  how  to  measure  lost  cycles,  using 
performance  predicates  to  specify  each  category,  and  a  predicate  profiler  to  measure  the  time  spent 
in  each  category.  We  have  implemented  predicate  profiling  for  Fortran  programs  on  the  Kendall 
Square  KSRl,  C  programs  on  the  Silicon  Graphics  Challenge  machine ,  and  Fortran  and  C  programs 
running  under  PVM  on  a  network  of  SUN  workstations.  The  remain  i-  r  (^f  thi-  paper  will  illustrate 
how  we  use  those  measurements  to  develop  performance  models  ot  ;  j  iv,  jcations. 


3  Building  Performance  Models  with  the  j  oolkit 

In  this  section  we  describe  how  to  generate  a  performance  model  for  a  parallel  application  using 
the  Lost  Cycles  Toolkit.  We  use  one-dimensional  FFT  (ID  FFT)  on  the  KSRl  as  an  example 
application  to  illustrate  each  of  the  steps  in  model  development.  In  our  implementation  of  ID 
FFT  we  use  a  2-stage  pipeline,  where  the  first  stage  consists  of  input  generation,  matrix  transpose, 
FFT,  and  matrix  scaling,  and  the  second  stage  consists  of  matrix  transpose,  FFT,  transpose,  and 
output  checking.  Each  stage  is  allocated  half  of  the  available  processors  and  uses  data  parallelism 
to  exploit  its  processors. 

Creation  of  a  performance  model  for  an  application  using  the  Lost  Cycles  Toolkit  involves  the 
following  steps: 

1.  The  user  designs  experiments  to  capture  the  necessary  overhead  measurements.  Based  on 
user  input,  the  tool  expgen  determines  the  necessary  application  executions  and  creates  a 
program  script  for  execution. 

2.  The  user  runs  the  experiments  by  executing  the  script;  the  predicate  profiler  (pp)  automati¬ 
cally  extracts  the  overhead  measurements  from  each  execution. 

3.  Using  the  program  measurements  as  input,  the  lea  tool  finds  the  best-fitting  default  uni¬ 
variate  model  for  each  overhead  category.  If  no  default  model  is  sufficiently  accurate  for  a 
category,  the  user  is  alerted  by  the  xmodgen  interface,  and  given  the  ability  to  suggest  an 
alternative  model. 

4.  Using  the  modgen  tool,  the  user  combines  the  univariate  models  into  the  final,  multivariate 
performance  model  of  the  application.  Sources  of  inaccuracy  in  the  final  model  can  be  traced 
through  the  xmodgen  interface  back  to  particular  categories  or  choices  of  experiments,  allowing 
the  user  to  adjust  models  or  compensate  for  noisy  data. 

As  described  above,  the  tool  is  intended  to  be  used  in  a  highly  interactive  fashion.  At  each 
step  in  the  modeling  process,  the  user  may  provide  information  that  facilitates  or  constrains  the 
modeling  process,  reducing  the  need  for  experiments,  increasing  the  accuracy  of  the  models,  or 
improving  the  descriptive  value  of  the  models.  Such  information  may  derive  from  an  analysis  of 
the  application,  or  previous  experience  with  application  modeling.  In  those  cases  where  the  user 
cannot  provide  the  appropriate  information,  the  toolkit  uses  reasonable  default  settings  to  produce 
a  model  automatically. 
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3.1  Experiment  Design  and  Generation 


In  developing  performance  models  from  measured  data,  there  is  a  tradeoff  between  the  time  and 
eflFort  required  to  gather  the  data  and  the  accuracy  of  the  resulting  models.  Our  approach  is 
based  on  the  optimum  experimental  design  techniques  presented  in  [Atkinson  and  Donev,  1992; 
Jain,  1991]. 

Our  starting  point  is  an  instrumented  program  for  ID  FFT.  The  performance  of  the  program 
can  characterized  by  '  actors:  P  (the  number  of  processors)  and  D  (the  size  of  the  input  array). 
The  levels  for  these  i  (that  is,  the  valid  values  each  factor  can  assume)  are  defined  by  the 

program  and  the  archit;  Since  we  use  a  two-stage  pipeline  in  the  implementation  of  ID  FFT, 

P  can  be  any  even  numbei  of  processors  up  to  the  limit  of  the  architecture.  The  levels  for  D  as 
used  in  the  program  have  the  form  2^,  where  N  ranges  from  4  to  9  (i.e.,  16,  32,  64,  128,  256  and 
512). 

The  first  step  in  experimental  design  is  to  choose  the  experiment  levels  for  each  factor;  .* 
is,  the  subset  of  possible  levels  for  each  factor  that  are  representative  of  the  experimental  desi^-:i 
space  as  a  whole.  In  general,  we  should  choose  at  least  four  levels  for  each  factor,  because  our 
formulae  are  at  most  second-order  (requiring  3  points  for  a  unique  fit)  and  because  we  want  some 
goodness-of-fit  feedback  (provided  by  the  fourth  point).  Additional  points  give  a  more  accurate 
measure  of  goodness-of-fit.  For  this  example,  we  chose  P=[4,  8,  16,  24]  and  D  =[64,  128,  256,  512]. 

The  actual  experiments  to  be  performed  correspond  to  some  combination  of  the  various  levels 
for  each  factor.  In  factorial  design  [Atkinson  and  Donev,  1992],  an  experiment  is  performed  for 
all  possible  combinations  of  the  levels  for  P  and  D,  which  would  require  16  experiments  in  our 
example.  While  this  approach  provides  broad  coverage  of  the  experiment  space,  it  can  be  very 
time-consuming.  A  fractional  (or  reduced)  factorial  design  attempts  to  provide  reasonable  coverage 
of  the  experiment  space,  while  requiring  significantly  fewer  experiments.  In  our  case  we  use  a  cross¬ 
tuple  approach  based  on  a  subset  of  the  levels  for  each  factor,  called  instance  levels.  For  each 
instance  level,  we  generate  experiments  containing  that  instance  level  and  all  other  possible  levels 
for  the  other  factors.  In  our  example,  this  cross-tuple  approach  requires  12  experiments  instead  of 
16.  1 

In  order  to  generate  experiments  using  the  expgen  tool,  the  user  must  supply  the  factors  i 
interest,  and  the  levels  for  each  factor.  The  user  may  choose  to  supply  the  instance  levels,  ; 
determine  the  actual  experiments  to  be  performed,  or  simply  the  number  of  experiments  des^  J. 
The  user  can  also  supply  a  replication  count,  which  determines  the  number  of  times  each  experiii;i‘nt 
is  performed.  The  toolkit  assumes  a  default  replication  count  of  2;  more  redundancy  in  experiments 
increaises  the  accuracy  of  the  results.  From  these  values,  the  experiment  generation  tool  (expgen) 
generates  the  actual  experiments. 

In  our  example,  expgen  used  instance  levels  P=[8,  24]  and  D  =[128,  512],  and  a  default  repli¬ 
cation  count  of  2.  The  output  of  expgen  is  a  program  script  containing  the  12  experiments  to 
be  performed.  Each  of  these  scripts  is  executed  twice  (possibly  overnight,  if  the  experiments  are 
time-consuming),  producing  predicate  profiling  information  that  is  stored  for  use  during  model 
generation. 


^The  number  of  experiments  saved  using  a  reduced-factorial  design  depends  on  the  number  of  factors,  so  the 
savings  could  be  much  greater  than  in  this  example. 
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3.2  Model  Generation 


We  use  the  measurements  produced  by  the  experiment  design  and  generation  steps  as  input  to  the 
model  generation  tool  modgen  to  create  a  model  for  the  application’s  performance.  Model  generation 
takes  place  in  two  steps.  We  first  generate  a  univariate  model  for  each  category  of  overhead  that 
captures  the  effects  of  varying  each  factor  in  isolation.  We  then  generate  a  multivariate  model  for 
each  overhead  category  that  captures  the  effects  of  varying  all  factcji-  ir.  tiiC  experimental  design 
space.  The  model  for  the  application  is  simply  the  sum  of  these  ovc:  i.  ;  mouels. 

Univariate  model  generation  is  straightforward.  The  lea  tool  [.r  -  rates  a  model  for  each  cate¬ 
gory  of  overhead  by  fitting  measured  data  to  standard  models  of  h;  is.  Our  toolkit  finds  the 
best  model  for  a  category  using  a  direct  method  of  general  linear  .  :  s(iaares  minimization  based 

a  Singular  Value  Decomposition  (SVD)  [Press  et  al.,  1988].  ^  T>  '  metric  used  for  fitting  is  the 
etermination  coefficient  which  ranges  between  0  and  1,  with  1  being  the  best  possible  fitting. 
The  determination  coefficient  is  the  fraction  of  the  total  variance  that  is  explained  by  the  fit  [Jain, 
1991]. 

For  each  cross-tuple,  modgen  generates  a  univariate  model  for  each  factor  by  calling  lea.  For 
example,  one  of  the  univariate  models  we  generate  describes  how  load  imbalance  varies  as  a  function 
of  P  when  D=128.  Another  univariate  model  describes  how  contention  varies  as  a  function  of  D 
when  P=8.  modgen  determines  which  models  to  generate,  and  calls  lea  to  fit  a  model  to  a  subset 
of  the  measured  data  points. 

In  multivariate  model  generation,  modgen  combines  and  refits  the  univariate  models.  Each 
multivariate  model  expresses  interactions  (or  lack  thereof)  among  the  various  factors.  If  two  factors 
do  not  interact,  then  a  reasonable  form  for  the  multivariate  model  is  the  sum  of  the  univariate 
m  :  kis.  If  two  factors  do  interact,  then  the  multivariate  model  includes  the  product  of  the  univariate 
models,  modgen  compares  the  fit  for  these  alternative  combinations  of  univariate  models  to  the 
measured  data  to  determine  the  form  of  the  resulting  combined  model. 

Using  this  approach,  we  are  able  to  produce  a  model  for  each  cross-tuple,  by  combining  all  the 
univariate  models  for  that  tuple  according  to  the  best  fit  for  their  interactions.  Since  the  number 
of  cross-tuples  is  the  product  of  the  number  of  instance  levels  for  each  factor  (and  therefore  usually 
greater  than  one),  this  approach  will  produce  several  multivariate  models  that  must  be  combined 
into  one.  We  incrementally  combine  these  models  as  they  are  generated  within  modgen,  using 
the  best  fit  among  the  two  models  or  their  sum.  modgen  uses  all  the  measured  data  points  in 
determining  this  fit. 

The  user  can  evaluate  and  improve  the  models  generated  by  the  toolkit  using  an  X  interface 
produced  by  the  xmodgen  tool.  The  toolkit  supports  hierarchical  viewpoints  of  the  performance 
modeling  process;  some  windows  describe  the  model  itself,  others  describe  the  detailed  information 
used  to  generate  the  model.  We  illustrate  the  interface  and  the  hierarchical  viewpoints  using  our 
FFT  example,  as  shown  in  Figure  1. 

The  Fitting  Overview  window  (Figure  la)  is  the  starting  point  of  the  evaluation.  This  window 
shows  the  average  determination  coefficient  for  each  univariate  model  representing  a  factor  and 
overhead  category  pair  (along  the  top  row  of  values),  and  for  the  multivariate  model  (the  bottom  row 

^Direct  methods  such  as  SVD  are  limited  to  fitting  curves  that  are  linear  in  their  coefficients.  Thus  direct 
methods  can  find  the  values  for  the  coefficients  (/c’s)  in  expressions  such  as  kix\og{x)  -h  k2X^  but  cannot 

typically  do  so  for  expressions  such  as  To  solve  nonlinear  expressions  it  is  usually  necessary  to  use  iterative 

methods,  but  fortunately  the  expressions  used  in  modeling  parallel  overheads  are  normally  relatively  simple  and  are 
characteristically  linear  in  their  coefficients. 
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of  values).  Values  close  to  1.0  (e.g.,  insufficient  parallelism  as  a  function  of  D  or  P,  communication 
loss  as  a  function  of  P)  represent  accurate  models;  smaller  values  (e.g.,  synchronization  loss  as  a 
function  of  D)  represent  inaccurate  models.  (Note  that  in  this  display  we  treat  the  time  remaining 
after  subtracting  out  all  sources  of  overhead,  rt,  as  simply  another  category  for  modeling  purposes.) 

In  this  example,  the  user  first  wants  to  know  the  cause  of  the  low  P?  value  for  synchronization 
loss  (0.35)  as  a  function  of  D.  Clicking  the  mouse  on  the  determination  coefficient  of  interest 
produces  the  Fitting  window  (Figure  lb).  This  window  shows  the  ./f'  \7  associated  with  each 
of  the  instances  of  P  used  to  generate  the  univariate  model.  Prom  wiiiclvt-.v  we  can  see  that  the 
results  for  P=8  are  poor,  and  the  results  for  P=24  are  terrible  (0.54  and  17,  respectively). 

Clicking  on  the  determination  coefficient  button  for  P=24  h  dii;  window  produces  the  Fit 
window  for  synchronization  loss  when  P=24  (Figure  Ic).  This  new  window  shows  the  two  default 
.  inear  models  for  synchronization  loss,  the  constants  for  those  models  produced  by  fitting  the 
measured  data  to  the  models,  and  the  corresponding  values.  Prom  the  poor  7?^  values,  we  can 
conclude  that  the  default  models  for  synchronization  loss  stored  in  the  toolkit  are  inadequate  for 
modeling  synchronization  loss  in  this  application. 

Clicking  on  the  “Graph”  button  at  the  bottom  of  this  window  produces  the  Ghostview  graph 
(Figure  Id)  of  these  two  models  and  the  measured  data.  This  graph  not  only  confirms  that 
the  default  formulae  are  inadequate,  but  also  indicates  that  the  measurements  exhibit  significant 
variance.  The  user  can  reduce  the  variance  by  increasing  the  replication  count  to  5  (recall  the 
default  is  2)  and  repeating  the  experiment  generation  step. 

By  clicking  the  “Formulae”  button  at  the  bottom  of  the  Fit  window,  the  user  can  edit  the 
standard  models  file  to  include  additional  models  of  overhead  for  a  given  category.  These  additional 
models  may  be  derived  by  exploiting  knowledge  of  the  source  code,  application  structure,  or  by 
a  visual  fit  of  the  measured  data  points.  The  editor  window  (Figure  le)  at  the  bottom  of  the 
figure  illustrates  this  process.  In  this  case,  we  added  the  formula  k  x  log{x)  x  +  k  based  on  the 
observation  that  the  ID  FFT  program  applies  the  transform  log{d)  times  to  a  matrix  of  size  d?. 

After  performing  a  similar  type  of  analysis  in  each  of  the  cases  where  the  generated  model  has 
a  low  determination  coefficient,  we  can  repeat  the  whole  process  from  the  experiment  generation 
step.  Continuing  with  our  example,  the  next  iteration  of  modeling  produces  the  results  in  Figure 
2,  As  can  be  seen  in  this  figure,  the  variance  in  the  data  in  the  graph  is  now  much  smaller 
(as  expected,  based  on  the  larger  replication  count),  and  the  new  model  for  synchronization  loss 
accurately  captures  the  measured  data.  At  this  point  all  but  one  of  the  determination  coefficients 
are  0.97  or  better,  and  the  remaining  coefficient  (resource  contention  as  a  function  of  P)  is  0.93. 

The  user  can  use  modgen  to  verify  the  accuracy  of  the  model  by  having  the  tool  use  the  model 
tr>  calculate  various  measured  data  points.  Once  again,  we  use  as  a  metric  to  define  how  well 
tla*  model  captures  the  data  points  used  in  verification.  The  user  can  also  use  the  model  to  predict 
values  for  data  points  that  were  not  measured. 


4  Using  Lost  Cycles  Models 

In  this  section  we  demonstrate  the  practical  application  of  our  modeling  technique  to  solving  prob¬ 
lems  in  parallel  programming.*  Our  first  example  uses  performance  models  to  estimate  the  optimal 
distribution  of  workload  in  a  heterogeneous  workstation  environment.  Our  second  example  illus¬ 
trates  the  use  of  our  modeling  method  for  predicting  the  effect  of  factors  other  than  the  number  of 
processors  and  the  data  size. 
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Figure  3:  Cholesky:  overhead  models  for  Sparc  LX 

4.1  Load  Distribution  on  a  Network  of  Workstations 

Our  first  example  uses  models  of  application  running  time  on  a  network  of  homogeneous  worksta¬ 
tions  to  predict  the  optimal  load  distribution  for  that  application  on  a  network  of  heterogeneous 
workstations.  Our  example  application  is  Cholesky  factorization  implemented  under  PVM.  Our 
network  consists  of  Sun  Sparcstation  1  and  LX  workstations.  The  problem  is  this:  if,  at  runtime, 
the  program  has  available  an  equal  number  of  Sun  I’s  and  LX’s,  how  should  the  application  load 
be  distributed  among  the  workstations?  ^ 

One  obvious  solution  is  to  distribute  the  load  in  proportion  to  the  estimated  speed  of  the 
processors  in  the  machines,  which  is  approximately  2/1.  Another  approach  is  to  use  a  sequential 
implementation  of  Cholesky  to  benchmark  the  machines  on  this  application,  and  distribute  the  load 
in  proportion  to  the  speed  of  the  machines  when  executing  this  application,  which  is  roughly  4/1. 
(Alternatively,  we  can  use  the  pure  computation  category  as  measured  by  our  predicate  profiler  to 
compute  this  ratio.)  Both  of  these  approaches  ignore  parallel  overheads  however.  Our  solution  is 
to  develop  independent  models  for  the  parallel  application  on  Sun  1  and  LX  workstations,  and  use 
these  models  to  calculate  the  ratio  of  the  speed  of  the  machines  on  the  parallel  application  as  a 
function  of  problem  size  and  number  of  processors. 

We  first  developed  models  for  the  running  time  of  Cholesky  on  a  homogeneous  set  of  worksta¬ 
tions  for  both  the  Sun  1  and  LX  machines,  varying  p  from  2  to  8,  and  n  (the  size  of  one  dimension 
of  the  matrix)  from  350  to  500.  The  overhead  categories  supported  by  our  measurement  system 
under  PVM  include  load  imbalance,  communication  loss,  computation  time,  and  miscellaneous 
overhead  related  to  contention  and  scheduling  effects  due  to  multiprogramming.  The  models  for 
these  individual  overhead  categories  as  produced  by  the  toolkit  for  each  type  of  workstation  are 
presented  in  figures  3  and  4.  From  these  figures  we  can  see  that  each  overhead  category  exhibits  a 
similar  trend  on  the  two  machines,  except  for  miscellaneous  overhead,  which  increases  dramatically 
with  an  increase  in  processors  on  the  Sparc  1,  but  not  on  the  LX. 

In  this  example  we  are  interested  in  the  proper  distribution  of  work  between  the  two  classes  of 
machines,  rather  than  in  tuning  the  program  to  reduce  a  particular  source  of  overhead  on  a  partic- 


would  be  straightforward  to  extend  this  example  to  an  unequal  number  of  workstations  of  each  type. 


Figure  4:  Cholesky:  overhead  models  for  Sparc  1 
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Figure  5:  Load  distribution  ratios 
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Figure  6:  LI  using  different  load  distributions 


ular  machine.  Thus,  we  are  interested  primarily  in  minimizing  load  imbalance  on  a  heterogeneous 
network.  Since  we  are  dealing  with  only  two  classes  of  machines,  we  can  express  the  distribution 
of  work  as  a  ratio  (workload  for  LX/workload  for  Sparc  1).  We  can  calculate  this  ratio  using  only 
the  pure  computation  time  measured  during  an  execution  of  the  application  or  using  the  overhead 
models  produced  by  the  toolkit.  In  figure  5,  we  present  both  ratios  as  a  function  of  the  number 
of  processors,  while  fixing  d=450.  In  comparing  the  load  distribution  suggested  by  the  ratio  of 
computation  times  for  the  application  to  the  ratio  provided  by  the  models,  we  see  that  the  models 
suggest  a  comparable  ratio  for  a  very  small  number  of  processors,  but  a  much  higher  ratio  of  work 
for  the  LXs  (8/1)  when  p  =  8. 

To  verify  this  result,  we  executed  the  application  using  both  ratios  on  a  network  of  4  Sun  I’s  and 
4  LX’s.  The  workload  distribution  provided  by  the  models  produced  an  improvement  in  running 
time  of  25-40%  depending  on  the  data  size.  The  reduction  in  load  imbalance  as  a  function  of  data 
size  is  show  in  figure  6.  As  seen  in  the  figure,  load  imbalance  decreases  by  more  than  50%  for  large 
data  sizes,  when  using  the  models  generated  by  the  toolkit  to  determine  the  workload  distribution. 

It  is  worth  noting  that  the  models  for  a  homogeneous  network  of  workstations  are  fairly  accurate 
at  predicting  the  running  time  of  the  application,  and  can  be  used  to  solve  the  problem  of  workload 
distribution,  even  though  we  did  not  fully  explain  all  the  cycles  lost  to  miscellaneous  overhead. 
Whether  the  dramatic  increase  in  this  overhead  category  on  the  Sun  I’s  that  results  from  an  increase 
in  processors  is  attributed  to  contention  or  scheduling  effects  isn’t  important  for  the  purposes  of 
solving  the  problem  of  workload  distribution  with  our  modeling  approach. 


4.2  Computation  Time  in  Speculative  Search 

In  earlier  work  [Growl  et  ah,  1994]  we  studied  the  performance  of  several  alternative  implementa¬ 
tions  of  subgraph  isomorphism  on  a  variety  of  shared-memory  multiprocessors.  In  that  work  we 


12 


performed  over  37,000  experiments  to  show  how  the  number  of  isomorphisms  sought,  the  number 
of  available  processors,  and  the  underlying  architecture  all  affect  the  choice  of  an  efficient  paral¬ 
lelization.  Here  we  use  our  modeling  technique  to  predict  the  amount  of  time  processors  spend 
searching  for  a  solution  on  a  KSRl. 


Our  parallel  prog 
morphism  [Ullman. 
graph  to  a  vertex  in 
vertices  of  the  sm; 
continues  imtil  an  i- 
point  the  algorithn 


■  •  based  on  Ullman’s  sequential  tree-search  algorithm  for  subgraph  iso- 
hich  successively  postulates  a  mapping  from  each  vertex  in  the  small 
graph.  Each  mapping  constrains  the  possible  mappings  for  succeeding 
The  process  of  postulating  mappings  for  vertices  in  the  small  graph 
ism  is  found,  or  until  the  constraints  preclude  such  a  mapping,  at  which 
es  a  difllerent  mapping  for  an  earlier  vertex. 


There  axe  many  exploit  parallelism  in  the  search  for  an  isomorphism.  In  this  example 

we  focus  on  the  coarsest  granularity  of  parallelism,  which  occurs  in  the  tree  search  itself.  We  search 
each  subtree  of  the  root  node  in  parallel,  using  depth-first,  sequential  search  at  the  remaining  levels. 
This  type  of  parallelism  is  speculative,  in  that  we  might  not  need  to  search  multiple  subtrees  of 
the  root  in  parallel  in  order  to  find  a  solution  quickly.  However,  if  the  solution  space  is  sparse,  we 
might  benefit  from  searching  several  subtrees  in  parallel. 

In  our  experiments  input  graphs  are  randomly  generated  from  four  parameters:  the  size  of  the 
small  and  large  graphs,  and  the  probability  for  each  graph  that  a  given  pair  of  vertices  will  be  joined 
by  an  edge.  We  use  S  for  the  number  of  vertices  in  the  small  graph,  L  for  the  number  of  vertices 
in  the  large  graph,  and  s  and  I,  respectively,  for  the  edge  probabilities  in  each  of  the  graphs.  The 
total  number  of  leaves  in  the  problem  space  tree  is 


This  number  is  very  large;  for  5  =  32  and  L  =  128  (the  size  of  our  experiments),  there  are 
approximately  4  x  10®^  leaves.  To  avoid  searching  among  all  the  possible  leaf  nodes,  we  apply  a  set 
of  filters  at  each  node  in  the  search  tree.  These  filters  prune  the  search  space  enough  to  make  the 
problem  tractable. 

In  our  earlier  work,  we  characterized  the  input  graphs  (and  the  difficulty  of  finding  a  solution) 
using  a  density  function.  We  define  the  density  d  of  the  search  space  to  be  the  probability  that  a 
randomly-selected  leaf  will  be  a  solution: 

d  =  (1  —  s  -f  sl)^^ , 

where  5^  is  the  number  of  potential  edges  in  the  small  graph  and  I  —  s  +  si  is  the  probability 
that  a  given  potential  edge  will  be  successfully  matched.  Our  goal  here  is  to  discover  how  the  time 
processors  spend  looking  for  a  solution  varies  with  density  when  searching  for  a  single  isomorphism. 

We  attempted  to  use  lost  cycles  analysis  to  develop  a  model  of  pure  computation  for  the  tree- 
parallel  implementation  of  subgraph  isomorphism  as  a  function  of  density.  Our  initial  attempt  was 
not  successful  however,  because  we  could  not  find  an  analytic  form  for  pure  computation  that  was 
linear  in  its  coefficients  and  produced  a  good  fit  to  the  measured  data.  This  empirical  evidence 
that  computation  time  is  not  a  simple  function  of  density  has  an  intuitive  explanation:  density  only 
captures  the  prevalence  of  solutions  in  the  leaf  nodes  of  the  tree,  whereas  computation  time  depends 
greatly  on  the  effectiveness  of  the  filters  used  for  pruning  (for  which  we  have  no  analytic  formula). 
Further  evidence  for  the  unsuitability  of  density  as  a  predictor  of  computation  time  comes  from  an 
examination  of  the  experimental  data,  wherein  different  pairs  of  s  and  I  (the  edge  probabilities  for 
the  two  graphs)  can  produce  the  same  density,  but  still  result  in  very  different  computation  times. 
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Subgraph  Isomorphism:  Running  times  (p=14,t=0.8) 
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Figure  7:  Subgraph  Isomorphism:  Computation  time  for  various  s  probabilities 

A  closer  examination  of  the  experimental  data  shows  that  computation  time  can  be  consistently 
predicted  as  a  function  of  s  and  1.  Therefore,  rather  than  model  computation  time  as  a  function 
of  density,  we  modeled  it  as  a  function  of  s  and  I  (which  are  used  to  calculate  density).  In  our 
experiments  we  varied  s  between  0.05  and  0.2,  I  between  0.8  and  0.95,  and  p  between  8  and  14.  ^ 
The  best  fit  of  computation  time  as  a  function  of  s  is  shown  in  Figure  7.  The  fitting  for  computation 
time  as  a  function  of  I  is  also  very  good. 

This  example  illustrates  several  strengths  of  our  approach.  First,  we  were  able  to  model  com¬ 
putation  time  as  a  function  of  variables  other  than  p,  the  number  of  processors,  and  n,  the  size 
of  the  input  data.  Second,  we  were  able  to  use  our  modeling  technique  to  discover  that  density 
was  an  inappropriate  measure  to  calculate  computation  time,  but  that  the  values  used  to  calculate 
density  (s  and  1)  were  an  appropriate  determining  factor  of  computation  time.  Third,  we  were  able 
to  create  a  multivariate  model  of  computation  time  (as  a  function  of  s  and  /),  which  was  required 
to  capture  the  complexity  of  the  input.  Finally,  we  were  able  to  develop  an  accurate  model  of 
computation  time  (and  a  good  model  of  running  time)  without  an  analytic  form  for  the  efficacy  of 
filters. 


5  Conclusion 

Performance  prediction  can  be  a  very  useful  tool  for  parallel  programmers,  offering  assistance  in 
the  difficult  task  of  tuning  and  evaluating  a  program’s  performance  and  scalability.  Unfortunately 
a  number  of  obstacles  have  prevented  programmers  from  routinely  building  performance  models 
of  their  programs.  These  obstacles  include  the  difficulty  of  measuring  the  many  potential  sources 
of  inefficiency  in  parallel  systems,  the  challenge  of  describing  those  overheads  mathematically,  and 
the  requirement  to  design  experiments  that  yield  models  of  suitable  accuracy. 

^We  limited  the  range  of  values  for  s  and  I  to  keep  the  time  spent  on  experiments  manageable.  This  application 
depends  on  a  random  number  generator  to  create  the  graphs,  so  each  experiment  has  to  be  replicated  many  times 
with  different  seeds  to  ensure  average  case  behavior.  Much  larger  values  of  s  or  smaller  values  of  I  would  significantly 
increase  the  difficulty  of  the  search,  and  the  running  time  of  each  experiment.  In  any  case,  the  values  we  use  here 
cover  the  range  between  sparse  and  dense  solution  spaces  in  our  earlier  paper. 
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The  Lost  Cycles  Toolkit  represents  an  attempt  to  make  performance  prediction  available  to 
parallel  programmers.  It  does  so  by  automating  as  much  as  possible  of  the  performance  modeling 
process.  The  toolkit  automates  experimental  design,  performance  measurement,  univariate  model¬ 
ing,  and  multivariate  modeling.  It  also  provides  feedback  to  the  user  on  the  quality  of  the  resulting 
model,  and  allows  tt  •  7-  to  trace  inaccuracies  back  to  the  component  models  and  performance 
data.  This  potenti::  dive  model  refinement  means  that  the  tools  can  aggressively  automate 

processes  like  mod'  on,  since  the  user  is  alerted  whenever  a  tool’s  decision  or  assumptions 

are  inaccurate. 

In  this  paper  w>  scribed  the  methods  and  tools  comprising  the  Lost  Cycles  Toolkit,  and 

we  have  shown  hov  ■  :  are  used  in  practice.  In  addition,  we  have  shown  examples  describing 

the  concrete  benei  : :  .j  derived  from  using  lost  cycles  analysis  to  build  performance  models 

of  applications.  Tht;,se  examples  demonstrate  the  utility  of  the  toolkit,  but  several  caveats  are  in 
order. 

•  Accurate  models  require  many  experiments.  The  toolkit  provides  an  instrumentation  package 
that,  at  least  for  Fortran  programs,  hides  the  complexity  of  instrumentation,  and  also  guides 
the  user  in  the  judicious  selection  of  experiments,  and  automates  (via  program  scripts)  the 
experiment  process.  Nonetheless,  the  experiment  phase  can  be  time  consuming  and  resource 
intensive,  and  may  not  be  practical  for  long-running  applications. 

•  The  set  of  default  models  embedded  in  the  toolkit  for  each  overhead  category  may  not  accu¬ 
rately  characterize  the  behavior  of  a  particular  application,  and  the  user  may  not  have  the 
expertise  required  to  suggest  an  alternative  model  based  on  an  analysis  of  the  program.  In 
the  worst  case,  the  toolkit  can  automatically  select  the  best  choice  from  the  default  models, 
or  the  user  can  suggest  an  alternative  based  on  a  visual  inspection  of  the  experiment  data 
points.  We  are  considering  adding  source  code  analysis  to  the  toolkit  to  assist  the  user  in 
selecting  default  models  for  overhead  categories. 

•  The  predicate  profiler  may  not  be  able  to  capture  all  the  categories  of  interest  in  every  environ¬ 
ment.  On  the  KSR  we  exploit  performance  monitoring  hardware  to  isolate  communication 
and  contention  costs;  on  the  SGI  machine  we  combine  these  two  categories  and  measure 
them  indirectly  by  comparing  actual  execution  time  to  emulated  execution  time  on  a  perfect 
memory  system  (using  the  pixie  tool). 

Despite  these  caveats,  we  have  found  the  Lost  Cycles  Toolkit  to  be  useful  for  performance 
modeling.  In  addition,  a  number  of  improvements  are  under  way  or  in  the  planning  stage.  The 
extension  of  lost  cycles  models  across  performance  boundaries  created  by  system  effects  (cache  and 
physical  memory  sizes)  represents  an  open  problem.  We  have  used  lost  cycles  analysis  on  runtime 
factors  other  than  n  and  p,  but  more  work  is  necessary  to  allow  automatic  model  selection  for  those 
factors.  Finally,  we  feel  that  the  extension  of  the  lost  cycles  methods  to  portions  of  programs,  rather 
than  complete  applications,  seems  a  promising  approach  to  the  problem  of  application  restructuring. 
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